Experiments in Authorship-Link Ranking and Complete Author Clustering
نویسندگان
چکیده
The paper presents the approach we developed for the AuthorshipLink Ranking and Complete Author Clustering task at the PAN 2016 competition. Given a document collection, the task is to group documents written by the same author, so that each cluster corresponds to a different author. This task can also be viewed as one of establishing authorship links between documents. We use a combination of classification and agglomerative clustering with a rich set of features such as average sentence length, function words ratio, type-token ratio and part of speech tags.
منابع مشابه
Author Clustering using Hierarchical Clustering Analysis
This paper presents our approach to the Author Clustering task at PAN 2017. We performed a hierarchical clustering analysis of different document features: typed and untyped character n-grams, and word n-grams. We experimented with two feature representation methods, log-entropy model, and tf-idf; while tuning minimum frequency threshold values to reduce the dimensionality. Our system was ranke...
متن کاملAuthor Clustering based on Compression-based Dissimilarity Scores
The PAN 2017 Author Clustering task examines the two application scenarios complete author clustering and authorship-link ranking. In the first scenario, one must identify the number (k) of different authors within a document collection and assign each document to exactly one of the k clusters, where each cluster corresponds to a different author. In the second scenario, one must establish auth...
متن کاملAuthor Identification Based on a Hybrid Feature Set Using Machine Learning and Clustering Techniques
Author identification of a document can be performed using computational or statistical method. In this paper, we try to identify the author of two ancient Arabic religious books dating from the 6th century: The holy Quran and the Hadith. Authorship identification consists in identifying the author of an anonymously document by using some techniques of Natural Language processing (NLP) and Arti...
متن کاملAuthoritative Re-Ranking in Fusing Authorship-Based Subcollection Search Results
We examine the use of authorship information to divide IR test collections into subcollections and apply techniques from the field of distributed information retrieval to enhance the baseline search results. We determine the expertise of each author, based on the content of their documents, and use this knowledge to construct rankings of the different author subcollections for each query. We go...
متن کاملOn co-authorship for author disambiguation
Author name disambiguation deals with clustering the same-name authors into different individuals. To attack the problem, many studies have employed a variety of disambiguation features such as coauthors, titles of papers/publications, topics of articles, emails/affiliations, etc. Among these, co-authorship is the most easily accessible and influential, since inter-person acquaintances represen...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016